Table of Contents

  • 练习1. 绘制水平柱状图 Profit by Sub-Category
  • 练习2. 绘制堆积柱状图 Profit by Sub-Category & Segment
  • 练习3. 绘制散点图 Sales and Profit Distrubution
  • 练习4. 绘制气泡图 Sales and Profit Distrubution
  • 练习5. 绘制时间序列图 Sales and Profit by day
  • 练习6. 绘制时间序列图 Sales and Profit by month

练习1. 绘制水平柱状图 Profit by Sub-Category¶

导入数据文件'Sample - Superstore.xls',绘制水平柱状图,展示每个商品子类别(Sub-Category)的利润(Profit),并根据利润大小显示颜色。

In [1]:
# Step1. 导入数据文件'Sample - Superstore.xls',并查看数据
import pandas as pd
df = pd.read_excel('./Sample - Superstore.xls',sheet_name='Orders')
df.head()
Out[1]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
0 1 CA-2017-152156 2017-11-08 2017-11-11 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420.0 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600 2 0.00 41.9136
1 2 CA-2017-152156 2017-11-08 2017-11-11 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420.0 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400 3 0.00 219.5820
2 3 CA-2017-138688 2017-06-12 2017-06-16 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036.0 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200 2 0.00 6.8714
3 4 US-2016-108966 2016-10-11 2016-10-18 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311.0 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775 5 0.45 -383.0310
4 5 US-2016-108966 2016-10-11 2016-10-18 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311.0 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680 2 0.20 2.5164

5 rows × 21 columns

In [2]:
# Step2. 数据分析:对Sub-Category进行分组,再对每组的Profit求和(数据聚合与分组操作——GroupBy机制)
data = df.groupby('Sub-Category')['Profit'].sum()
# df['Profit'].groupby(df['Sub-Category']).sum()  # 另一种写法
print(data)  # 返回Series:左侧是索引index,右侧是值values
print('\n',data.index)  # 使用.index属性查看索引
print('\n',data.values)  # 使用.values属性或者to_numpy()方法查看值(返回数组)
Sub-Category
Accessories    41936.6357
Appliances     18138.0054
Art             6527.7870
Binders        30221.7633
Bookcases      -3472.5560
Chairs         26590.1663
Copiers        55617.8249
Envelopes       6964.1767
Fasteners        949.5182
Furnishings    13059.1436
Labels          5546.2540
Machines        3384.7569
Paper          34053.5693
Phones         44515.7306
Storage        21278.8264
Supplies       -1189.0995
Tables        -17725.4811
Name: Profit, dtype: float64

 Index(['Accessories', 'Appliances', 'Art', 'Binders', 'Bookcases', 'Chairs',
       'Copiers', 'Envelopes', 'Fasteners', 'Furnishings', 'Labels',
       'Machines', 'Paper', 'Phones', 'Storage', 'Supplies', 'Tables'],
      dtype='object', name='Sub-Category')

 [ 41936.6357  18138.0054   6527.787   30221.7633  -3472.556   26590.1663
  55617.8249   6964.1767    949.5182  13059.1436   5546.254    3384.7569
  34053.5693  44515.7306  21278.8264  -1189.0995 -17725.4811]
In [3]:
# Step3. 绘制水平柱状图
import plotly.graph_objects as go

fig = go.Figure(data=go.Bar(
    y=data.index,
    x=data.values,
    marker=dict(
        color=data.values,
        colorscale=[[0,'#fb8c00'],[0.25,'white'],[1,'#1565c0']],
        showscale=True),
    orientation='h'
))

fig.update_layout(
    title='Profit by Sub-Category',
    plot_bgcolor='white',  
    xaxis=dict(title='Profit',gridcolor='rgba(0,0,0,0.05)'),
    yaxis=dict(title='Sub-Category')
)

fig.show()
In [4]:
# 思考:如何设置一个利润阈值?柱形的颜色由是否超过阈值来决定

thred = 25000
colors=['#BAB0AC']*data.shape[0]
for i in range(data.shape[0]):
    if data[i]>thred:
        colors[i]='#E15759'

fig = go.Figure(data=go.Bar(
    y=data.index,
    x=data.values,
    marker=dict(color=colors),
    orientation='h'
))

fig.update_layout(
    title='Profit by Sub-Category',
    plot_bgcolor='white',  
    xaxis=dict(title='Profit',gridcolor='rgba(0,0,0,0.05)'),
    yaxis=dict(title='Sub-Category'),
)

fig.show()

练习2. 绘制堆积柱状图 Profit by Sub-Category & Segment¶

导入数据文件'Sample - Superstore.xls',使用堆积柱状图展示每个商品子类别(Sub-Category)中,不同细分客户(Segment)的利润(Profit)。

In [5]:
# Step1. 导入数据文件'Sample - Superstore.xls',并查看数据
import pandas as pd
df = pd.read_excel('./Sample - Superstore.xls',sheet_name='Orders')
df.shape,df.columns
Out[5]:
((9994, 21),
 Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
        'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State',
        'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category',
        'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit'],
       dtype='object'))
In [6]:
# Step2. 数据分析:从绘图角度来思考,一个图形fig中有三个trace(Segment),每个trace对应多个柱形(Sub-Category)
# 对Segment和Sub-Category两个指标进行分组,再对每组的Profit求和(数据聚合与分组操作——GroupBy机制)
data = df.groupby(['Segment','Sub-Category'])['Profit'].sum()
# df['Profit'].groupby([df['Segment'],df['Sub-Category']]).sum()  # 另一种写法
data
Out[6]:
Segment      Sub-Category
Consumer     Accessories     20735.9225
             Appliances       6981.9282
             Art              3454.3011
             Binders         17995.5972
             Bookcases       -4435.6382
             Chairs          13235.3319
             Copiers         24083.7106
             Envelopes        3264.4126
             Fasteners         576.8008
             Furnishings      7919.4227
             Labels           3075.9884
             Machines         2141.0618
             Paper           15534.6436
             Phones          23837.1147
             Storage          7104.2004
             Supplies        -1657.5513
             Tables          -9728.0378
Corporate    Accessories     12707.4805
             Appliances       7429.8952
             Art              2004.6477
             Binders          6377.3201
             Bookcases         638.4502
             Chairs           8344.6565
             Copiers         18990.2789
             Envelopes        2571.2290
             Fasteners         251.9030
             Furnishings      3508.2077
             Labels           1760.8273
             Machines          703.0190
             Paper           10361.5468
             Phones          11766.2196
             Storage          9131.0247
             Supplies          338.9264
             Tables          -4906.4986
Home Office  Accessories      8493.2327
             Appliances       3726.1820
             Art              1068.8382
             Binders          5848.8460
             Bookcases         324.6320
             Chairs           5010.1779
             Copiers         12543.8354
             Envelopes        1128.5351
             Fasteners         120.8144
             Furnishings      1631.5132
             Labels            709.4383
             Machines          540.6761
             Paper            8157.3789
             Phones           8912.3963
             Storage          5043.6013
             Supplies          129.5254
             Tables          -3090.9447
Name: Profit, dtype: float64
In [7]:
# 得到的结果是一个MultiIndex(多重索引)的Series:左侧是两层索引index,右侧是值values
# data.index  # 使用index属性查看多重索引
print(data.index.levels[0])  # 查看外层索引
print(data.index.levels[1])  # 查看内层索引
print(data['Consumer'])  # 指定任一外层索引,查看内层Series
Index(['Consumer', 'Corporate', 'Home Office'], dtype='object', name='Segment')
Index(['Accessories', 'Appliances', 'Art', 'Binders', 'Bookcases', 'Chairs',
       'Copiers', 'Envelopes', 'Fasteners', 'Furnishings', 'Labels',
       'Machines', 'Paper', 'Phones', 'Storage', 'Supplies', 'Tables'],
      dtype='object', name='Sub-Category')
Sub-Category
Accessories    20735.9225
Appliances      6981.9282
Art             3454.3011
Binders        17995.5972
Bookcases      -4435.6382
Chairs         13235.3319
Copiers        24083.7106
Envelopes       3264.4126
Fasteners        576.8008
Furnishings     7919.4227
Labels          3075.9884
Machines        2141.0618
Paper          15534.6436
Phones         23837.1147
Storage         7104.2004
Supplies       -1657.5513
Tables         -9728.0378
Name: Profit, dtype: float64
In [8]:
# Step3. 绘制堆积柱状图
import plotly.graph_objects as go

fig=go.Figure()

# 使用for循环来遍历segment,绘制多个trace
for i in data.index.levels[0]: 
    fig.add_trace(
        go.Bar(
            x=data[i].index,  # 当前segment对应Series的索引
            y=data[i].values,  # 当前segment对应Series的值
            name=i  # 当前segment名称
        ))
    
fig.update(layout=dict(
    title='Profit by Sub-Category & Segment',
    barmode='stack',
))

fig.show()
In [9]:
# 思考:如何自定义颜色?

fig=go.Figure()

colors=['#76B7B2','#BAB0AC','#FF9DA7']  # 自定义颜色列表(离散值)

# 使用for循环来遍历segment,绘制多个trace
for i,c in zip(data.index.levels[0],colors):   # 使用zip()函数同时遍历两个序列
    fig.add_trace(
        go.Bar(
            x=data[i].index,  # 当前segment对应Series的索引
            y=data[i].values,  # 当前segment对应Series的值
            name=i,  # 当前segment名称
            marker=dict(color=c)  # 将离散颜色值传递给color参数
        ))

fig.update(layout=dict(
    title='Profit by Sub-Category & Segment',
    barmode='stack',
))

fig.show()

练习3. 绘制散点图 Sales and Profit Distrubution¶

导入数据文件'Sample - Superstore.xls',绘制散点图,展示商品子类别(Sub-Category)中'Paper'销售额(Sales)和利润(Profit)的相关关系,交互时文本显示折扣(Discount),从而进一步分析这些变量之间的关系。

In [10]:
# Step1. 导入文件'Sample - Superstore.xls',对'Sub-Category'中'Paper'产品进行筛选
import pandas as pd
df = pd.read_excel('Sample - Superstore.xls',sheet_name='Orders')
data = df[df['Sub-Category']=='Paper']  # 布尔值索引
data.head()
Out[10]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
12 13 CA-2018-114412 2018-04-15 2018-04-20 Standard Class AA-10480 Andrew Allen Consumer United States Concord ... 28027.0 South OFF-PA-10002365 Office Supplies Paper Xerox 1967 15.552 3 0.2 5.4432
34 35 CA-2018-107727 2018-10-19 2018-10-23 Second Class MA-17560 Matt Abelman Home Office United States Houston ... 77095.0 Central OFF-PA-10000249 Office Supplies Paper Easy-staple paper 29.472 3 0.2 9.9468
56 57 CA-2017-111682 2017-06-17 2017-06-18 First Class TB-21055 Ted Butterfield Consumer United States Troy ... 12180.0 East OFF-PA-10001569 Office Supplies Paper Xerox 232 32.400 5 0.0 15.5520
58 59 CA-2017-111682 2017-06-17 2017-06-18 First Class TB-21055 Ted Butterfield Consumer United States Troy ... 12180.0 East OFF-PA-10000587 Office Supplies Paper Array Parchment Paper, Assorted Colors 14.560 2 0.0 6.9888
64 65 CA-2016-135545 2016-11-24 2016-11-30 Standard Class KM-16720 Kunst Miller Consumer United States Los Angeles ... 90004.0 West OFF-PA-10003892 Office Supplies Paper Xerox 1943 146.730 3 0.0 68.9631

5 rows × 21 columns

In [11]:
# Step2. 绘制散点图
import plotly.graph_objects as go

fig = go.Figure(go.Scatter(
    x=data['Sales'],
    y=data['Profit'],
    mode='markers',
    marker_size=12,  
    text=data['Discount'],
    opacity=0.7
))

fig.update_layout(
    title='Sales and Profit Distrubution of Paper',
    xaxis=dict(title='Sales'),
    yaxis=dict(title='Profit')
)

fig.show()
In [12]:
# 延伸;用气泡的颜色来展示不同地区(Region)的数据点
fig = go.Figure()

for i in data['Region'].unique():
    fig.add_trace(go.Scatter(
        x = data.loc[data['Region']==i,'Sales'],
        y = data.loc[data['Region']==i,'Profit'],
        mode='markers',
        name=i,
        marker_size=12,
        opacity=0.7
    ))

fig.update_layout(
    title='Sales and Profit Distrubution of Paper by Region',
    xaxis=dict(title='Sales'),
    yaxis=dict(title='Profit')
)

fig.show()

练习4. 绘制气泡图 Sales and Profit Distrubution¶

导入数据文件'Sample - Superstore.xls',绘制气泡图,展示销售额最高的前50名客户的销售额(X轴)和利润(Y轴)的关系,气泡的大小size和颜色color均体现折扣(Discount)这一变量,交互时增加显示的文本text:Customer Name和Discount(如交互所示)。

In [13]:
# Step1. 导入文件'Sample - Superstore.xls',数据分析得到销售额最高的前50名客户的销售额(Sales)、利润(Profit)和折扣(Discount)
import pandas as pd
df = pd.read_excel('Sample - Superstore.xls',sheet_name='Orders')
data = df.groupby('Customer Name').agg({'Sales':'sum','Discount':'mean','Profit':'sum'})
data = data.sort_values(by='Sales',ascending=False)
data = data[:50]
data.head()
Out[13]:
Sales Discount Profit
Customer Name
Sean Miller 25043.050 0.246667 -1980.7393
Tamara Chand 19052.218 0.116667 8981.3239
Raymond Buch 15117.339 0.094444 6976.0959
Tom Ashbrook 14595.620 0.080000 4703.7883
Adrian Barton 14473.571 0.240000 5444.8055
In [14]:
# Step3. 绘制气泡图
import plotly.graph_objects as go

text=[]
for i in range(50):
    text.append(('Customer Name: {0}<br>Discount: {1:.2%}').format(data.index[i],data.iloc[i,1]))

fig = go.Figure(go.Scatter(
    x=data['Sales'],
    y=data['Profit'],
    mode='markers',
    marker=dict(
        color=data['Discount'],
        colorscale='viridis',
        showscale=True,
        size=data['Discount'],
        sizeref=2*max(data['Discount'])/(10**2)),
    text=text,
))

fig.update_layout(
    title='Top-50 Customers\' Sales and Profit',
    xaxis_title='Sales',
    yaxis_title='Profit'
)

fig.show()

练习5. 绘制时间序列图 Sales and Profit by day¶

导入数据文件'Sample - Superstore.xls',绘制时间序列图,展示2018年每天的销售额(Sales)和利润(Profit)。

In [15]:
# 导入数据并进行数据分析
import pandas as pd
df = pd.read_excel('Sample - Superstore.xls',sheet_name='Orders')

# 每天销售额和利润
data = df.groupby('Order Date')[['Sales','Profit']].sum()
data
Out[15]:
Sales Profit
Order Date
2015-01-03 16.4480 5.5512
2015-01-04 288.0600 -65.9901
2015-01-05 19.5360 4.8840
2015-01-06 4407.1000 1358.0524
2015-01-07 87.1580 -71.9621
... ... ...
2018-12-26 814.5940 61.1202
2018-12-27 177.6360 -31.9742
2018-12-28 1657.3508 253.1188
2018-12-29 2915.5340 644.4338
2018-12-30 713.7900 101.5365

1236 rows × 2 columns

In [16]:
# 索引和切片
print(data.loc['2018'],'\n')  # 2018年的记录
print(data.loc['2018-02'],'\n')  # 2018年2月的记录
print(data['2018-12-01':'2018-12-15'])  # 切片
                Sales    Profit
Order Date                     
2018-01-01  1481.8280 -181.4109
2018-01-02  2079.5540 -207.0473
2018-01-03  2070.2720  704.2800
2018-01-06    33.7400   15.5204
2018-01-07  3395.5900  758.7192
...               ...       ...
2018-12-26   814.5940   61.1202
2018-12-27   177.6360  -31.9742
2018-12-28  1657.3508  253.1188
2018-12-29  2915.5340  644.4338
2018-12-30   713.7900  101.5365

[322 rows x 2 columns] 

                Sales    Profit
Order Date                     
2018-02-02   913.3540  170.6770
2018-02-03   922.3270  215.5700
2018-02-04    32.6700    8.4942
2018-02-05  2263.0120   74.8820
2018-02-06   904.3540  204.3158
2018-02-09   773.7640 -411.9726
2018-02-10   227.1030   28.1274
2018-02-11  1241.5160  130.1018
2018-02-13  1058.4300  424.3345
2018-02-16  1337.4420   95.9756
2018-02-17  2964.8174 -383.5478
2018-02-18   287.3260   62.4082
2018-02-19  1314.5900  377.0515
2018-02-20  1150.2900 -107.5121
2018-02-21    47.9040   -2.9940
2018-02-23   117.8000   42.3700
2018-02-24  1448.6760  249.3929
2018-02-25   430.4920  -19.3798
2018-02-26  2847.6460  447.3532
2018-02-28    17.6200    8.2242 

               Sales     Profit
Order Date                     
2018-12-01  5331.178   718.8920
2018-12-02  9951.182    -7.3410
2018-12-03  1403.842   280.7407
2018-12-04  2639.638   -21.9881
2018-12-05  1453.136   447.6235
2018-12-06    10.680     2.8836
2018-12-07  2916.514 -2686.6673
2018-12-08  7643.041  1154.6045
2018-12-09  5470.390  1487.1418
2018-12-10  3873.559   715.5696
2018-12-11  2823.965   -82.4089
2018-12-13   580.936    99.2154
2018-12-14  3897.714   215.2500
2018-12-15   306.888    52.5946
In [17]:
# 绘制时间序列图:2018年每天的销售额(Sales)和利润(Profit)
import plotly.graph_objects as go
import pandas as pd

data = df.groupby('Order Date')[['Sales','Profit']].sum()
data = data.loc['2018']

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=data.index,
    y=data['Sales'],
    name='Sales'
))
fig.add_trace(go.Scatter(
    x=data.index,
    y=data['Profit'],
    name='Profit'
))

fig.update_traces(opacity=0.8)

fig.update_layout(
    title='Sales and Profit in 2018'
)

fig.show()

练习6. 绘制时间序列图 Sales and Profit by month¶

导入数据文件'Sample - Superstore.xls',绘制时间序列图,展示2018年每个月的销售额(Sales)和利润(Profit)。

对Order Date进行groupby操作后,时间戳是每天(D),如果想要将其转换为每月(M),可以通过重新采样来实现。重新采样是指将时间序列从一个频率转换为另一个频率的过程。将更高频率的数据聚合到低频率被称为向下采样,反之则称为向上采样。Pandas对象配有resample方法,与groupby方法类似,调用resample时需要对数据分组,之后再调用聚合函数。

In [18]:
# 重新采样
data = df.groupby('Order Date')[['Sales','Profit']].sum()
data = data.loc['2018'].resample('M').sum()
data
Out[18]:
Sales Profit
Order Date
2018-01-31 43971.3740 7140.4391
2018-02-28 20301.1334 1613.8720
2018-03-31 58872.3528 14751.8915
2018-04-30 36521.5361 933.2900
2018-05-31 44261.1102 6342.5828
2018-06-30 52981.7257 8223.3357
2018-07-31 45264.4160 6952.6212
2018-08-31 63120.8880 9040.9557
2018-09-30 87866.6520 10991.5556
2018-10-31 77776.9232 9275.2755
2018-11-30 118447.8250 9690.1037
2018-12-31 83829.3188 8483.3468
In [19]:
data.index.strftime('%Y-%m')  # 转换时间格式
Out[19]:
Index(['2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
       '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12'],
      dtype='object', name='Order Date')
In [20]:
# 绘制时间序列图:2018年每个月的销售额(Sales)和利润(Profit)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=data.index.strftime('%Y-%m'),
    y=data['Sales'],
    name='Sales',
    marker_color='#ff7043'
))
fig.add_trace(go.Scatter(
    x=data.index.strftime('%Y-%m'),
    y=data['Profit'],
    name='Profit',
    marker_color='#29b6f6'
))
        
fig.update_layout(
    title='Sales and Profit in 2018',
    xaxis=dict(dtick='M1')  # X轴刻度显示为每一个月
)

fig.show()
In [ ]: